home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 26
/
Cream of the Crop 26.iso
/
editor
/
dedupe12.zip
/
DEDUPE.DOC
next >
Wrap
Text File
|
1997-05-12
|
8KB
|
168 lines
DeDupe (or Extract Unique Lines) V1.2
by John Augustine
DD (DeDupe) was written in Assembly for MS Dos Systems.
DD is Simple to use. Requires Color Graphics (CGA) or better. Runs on 8086
to Pentium PCs. Does Not Need a Lot of Memory to Operate.
IMPORTANT: If the size of the File is Very Large, I recommend a Fast
Computer.
DD will Remove ALL (Exact) Duplicate Lines located Anywhere in the File. The
File Must be an Ascii File with Lines that are No Longer than 255 Characters.
Lines MUST End with a Carriage Return (CR) and Line Feed (LF) (Usually does).
Also, DD has Extract Unique Lines Feature and a File/s Viewer (View One or Many
Files in Text Mode in a Single Pass).
The Difference between "DeDupe" and "Extract Unique":
The "Original" Lines, located closer to the beginning of the File, will go
into the Created "DeDupped" file (".DDD" Extension), while the Duplicate Lines,
located "downstream", will be passed over (Ignored), or (Option) go into
another File referred to as the "Dupes" File (".DUP" Extension). "Extract
Unique" also Removes the "Original" Line/s when a Duplicate/s exists, and Only
Unique Lines (Lines without a Duplicate) will enter the "Unique" File (".UUU"
Extension).
"Extract Unique" can be Very Useful. Here is One of Many possible scenarios.
You found out that someone was "Tampering" with a Report. Lines were deleted
from various parts of the File. New Lines were added at Different parts of the
File. If that wasn't Bad Enough, some lines were modified. The "Intruder"
didn't know that you have a Backup of the File. Now you want to know what
Lines were added, deleted, or modified. This is Very Difficult if the File has
several thousand Lines. A side by side comparison is Difficult if the Modified
File had Lines Removed or Added, which shifts the rest of the Lines Up or Down
at Various "Points" throughout the File.
Simply Merge (Join) one File with the other using Dos's Copy Command:
"Copy File1+File2 NewFile" (without quotes)
"File1" is the Original File, "File2" is the "Tampered" File, and "NewFile"
is the Two Files Merged into One. Note: The Original File could be "File2"
instead of "File1" and vise versa. The outcome will still be the Same. Select
"Extract Unique Lines" from the DD Menu to Create a File with Only Unique Lines
(New Lines added in Tampered File, Deleted Lines will Only Exist in the
Original File which will make them "Unique", and Modified Lines will be
"Unique" which also makes the Original Lines in the Original File "Unique").
In other words, all the Un-modified Lines that Exists in Both the Original File
and the "Tampered" File will "Cancel" each other out leaving only the "Unique"
Lines for the "Unique" File (Created).
GETTING STARTED:
Type (without quotes) "DEDUPE FileName.Ext" and press Enter, or
Type "DEDUPE" and press Enter. You can Enter the File Name Later.
LARGE FILES (Several Thousand Lines) TAKE "TIME":
See "TECH.DOC" for the Details on why it takes "Time" when "DeDupping" a
Large File. Note: The reason has to do with the number of Line Comparisons
involved, which is an Astronomical Number for Large Files. Note: Some "Un-
Dupe" Utilities require a Sorted File. Removing Duplicate Lines from a Sorted
File (using another "Un-Dupe" Utility) is Very Simple and Very Fast.
POSSIBLE PROBLEMS "DEDUPPING" A FILE:
1. CR/LF Only Lines used for Spacing:
DD Defaults to Ignoring lines that Start with CR/LF (Only) which are used for
Spacing between Paragraphs, etc. You can Toggle this Off (Don't Ignore) and
Remove those (Duplicate) CR/LF Lines (Except the First "Original" one). Here
is a Reason (Example) you may want those CR/LF Lines Removed.
Input File (In Part) ("<CR>" represents Both CR and LF):
<CR>
I have a Message for you. My Subject will be Little Bo Peep.
It says here, Little Bo Peep has lost her sheep.
<CR>
I have a Message for you. My Subject will be Little Bo Peep.
It says here, Little Bo Peep has lost her sheep.
<CR>
Bla Bla Bla ....
Created "DeDupped" File.DDD:
<CR>
I have a Message for you. My Subject will be Little Bo Peep.
It says here, Little Bo Peep has lost her sheep.
<CR>
<CR>
Bla Bla Bla ....
2. Lines that are used as Markers or Separators (Example: "---------------")
throughout the File (which are Duplicates). Removing those Separators could
make the Document more Difficult to "follow". DD is Not designed to Ignore
other Patterns (Only CR/LF).
IMPORTANT NOTES:
There is an Option (In DD's Sub-Menu) available for you to Create another
File for All the Duplicate Lines (Good for Reference) during the "DeDupe"
Process. If there are No Duplicate Lines in the File, the ".DUP" File will
have 0 Bytes. The Same thing happens (0 Byte File) with "Extract Unique" for
Two Merged Files which are Exactly the Same (No Unique Lines). If Both Files
had a EOF Marker Character, Dos will eliminate the EOF Marker of the First File
during the "Merging" Process and the Merged Files will have an EOF Marker at
the End of the Combined File. This EOF Character will become the Only "Unique"
Line in the Created "Unique" File, when Both Files are Exact Duplicates with
an EOF Character at the End.
TEST DD YOURSELF:
Note: I included "TEST1.TXT" and "TEST2.TXT" Files. "TEST2.TXT" has several
Duplicate lines throughout. Use DD's File Viewer to Look at both Files using
the "TEST*.TXT" for the File Name (Wildcard to see both Files in One Pass).
1. Select "DeDupe" at the Main Menu and "DeDupe" (Remove Duplicate Lines)
"TEST2.TXT" (Creates another File with .DDD (Default)).
2. Now, back at the DD's Main Menu, select View File/s again.
3. Enter (without quotes) "TEST2.DDD" for the File Name to see the File
Created without any Duplicate Lines. "TEST2.DDD" will be Exactly the same as
"TEST1.TXT" (Reference File for Comparison).
IMPORTANT NOTE:
If you make your own "Test" file with Duplicates. The Duplicate Lines MUST
be Added "Downstream" from their Original location in the File.
After you "DeDupe" your own Test File, and if the ".DDD" File is Not a
Perfect Match to your Original Reference File, use DD's File Viewer and go to
the End (Press End Key) and Check the EOF (End of File) Character. You may
have "Pasted" a Duplicate Line at the End, which Possibly (Depends on the
Editor) Indents the EOF Character after the Last Line (Mine does). That is the
Reason the Two Files are not a Perfect Match. View the End of your Reference
File for comparison.
DD LIMITATIONS:
There is No Limit to the Size of the File that DD can Handle. There is a
Limit for the number of Lines in the File, and the number of Characters (255
Maximum) in Each Line. Don't "DeDupe" a File with more than 520,000 Lines. It
is Very Unlikely that you have a File this Big unless you own a Large Company.
Note: If the Average Line Length is 60 Characters per Line, that would be
almost 32 Megabytes!
FINAL COMMENTS:
Any Comments, Complaints, or Suggestions are ALWAYS Welcomed. Any questions,
please include a Self Addressed Stamped Envelope, or send me E-Mail.
A Small Donation for All my Work will be GREATLY APPRECIATED and Motivate me
into other Projects that may be Beneficial to you. If you have a question,
don't feel obligated to make a donation in order to get an Answer.
E-Mail: john.augustine@gmiibbs.com
John Augustine N3AOF
3129 Earl St.
Laureldale, Pa 19605
(610) 929-8850